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Abstract. Let A be an n x n matrix, X be an n x p matrix and Y = AX. A challenging and 
important problem in data analysis, motivated by dictionary learning and other practical problems, 
is to recover both A and X, given Y. Under normal circumstances, it is clear that this problem 
is underdetermined. However, in the case when X is sparse and random, Spielman, Wang and 
Wright showed that one can recover both A and X efficiently from Y with high probability, given 
that p (the number of samples) is sufficiently large. Their method works for p > Cv? log^ n and 
they conjectured that p > Cn log n suffices. The bound nlogn is sharp for an obvious information 
theoretical reason. 

In this paper, we show that p > Cnlog^n suffices, matching the conjectural bound up to a 
polylogarithmic factor. The core of our proof is a theorem concerning L concentration of random 
matrices, which is of independent interest. 

Our proof of the concentration result is based on two ideas. The first is an economical way to 
apply the union bound. The second is a refined version of Bernstein’s concentration inequality for 
the sum of independent variables. Both have nothing to do with random matrices and are applicable 
in general settings. 
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1. Introduction 

Let A be an n X n invertible matrix and X be an n x p matrix; set Y := AX. The aim of this 
paper is to study the following recovery problem: 

Given Y, reconstruct A and X. 

It is clear that in the equation 

(1.1) Y = AX, 

we have +np unknowns (the entries of A and X), and only np equations (given by the entries of 
Y). Thus, the problem is underdetermined and one cannot hope for a unique solution. However, 
in practice, X is frequently a sparse matrix. If X is sparse, the number of unknowns decreases 
dramatically, as the majority of entries of X are zero. The name of the game here is to find the 
minimum value of p, the number of observations, which guarantees a unique recovery (e.g. [2] and 

m)- 

One real-life application that motivates the studies of this problem is dictionary learning. The 
matrix A can be seen as a hidden dictionary, with its columns being the words. X is a sparse sample 
matrix. This means that in the columns of Y we observe linear combinations of a few columns 
of A. From these observations, we would like to recover the dictionary. An archetypal example is 
facial recognition m SO]. A database of observed faces is used to generate the dictionary and once 
the dictionary is found, the problem of storing and transmitting facial images can be done very 
efficiently, as all one needs is to store and transmit few coefficients. In fact, such dictionary-learning 
techniques can be utilized to recognize faces that are partially occluded or corrupted with noise HZ). 
For more discussion and real-life examples, we refer to [9|, [12) and the references therein. Another 
practical situation in which the recovery problem appears essential is blind source separation and 
we refer the reader to [20] for more details. 

There have been many approaches to efficient recovery beginning with the work of m- Let us 
mention, among others, online dictionary learning by m, SIV |7], the relative Newton method 
for source separation by m, the Method of Optimal Directions by [3], K-SVD in [T], and scalable 
variants in m- 

While various different approaches have been considered, there have not been many rigorous 
results concerning performance. The first such result has been obtained by Spielman, Wang and 
Wright 115) concerning recovery with random samples; in other words, X is a random sparse matrix. 
Before stating their result, we need to discuss the meaning of unique and the random model. First, 
notice that iiY = AX, then Y = {AV){y~^X) for any diagonal matrix V with non-zero diagonal 
entries. Furthermore, one can freely permute the columns of A and the rows of X accordingly while 
keeping Y the same. In the rest of the paper, unique recovery will be understood modulo these two 
operations. 

To model X, one considers random Bernoulli-subgaussian matrices, defined as follows: X is a 
matrix of size n x p with iid entries Xij, where 

( 1 . 2 ) Xij := 

where Xij s-re iid indicator random variables with P(xij) = d and ^ij are iid random variables with 
mean 0, variance bounded by 1, 

m e [1/10,1], 

and 

P(|CI > i) < 2exp(—1^/2). 

This model includes many important distributions such as the standard Gaussians and Rademach- 
ers. The 1/10 is introduced for convenience of analysis and not critical to the argument. 

Spielman et. al. proved 
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Theorem 1.1. There are constants C > 0,C' > 0 such that the following holds. Let A be an 
invertible n x n matrix and X a sparse random n x p matrix with 2fn < 6 < Cj\fn and f,ij 
having a symmetric distribution. Then for p > Cn^ log^ n, one can efficiently find a solution with 
probability 1 — o(l). 

Here and later, efficient means polynomial time. The algorithm designed for this purpose is called 
ER-SpUD, whose main subroutine is li optimization. We are going to present and discuss this 
algorithm in Section In the dictionary learning problem, p is the number of measurements, and 
it is important to optimize its value. From below, it is easy to see that we must have p > cn log n 
for some constant c > 0. Indeed, if 0 = 2/n (or c'/n for any constant c') and p < cn log n for 
a sufficiently small constant c, then the coupon collector argument shows that with probability 
1 — o(l), X has an all-zero row. In this case, changing the corresponding column of A will not effect 
Y, and an unique recovery is hopeless. Spielman et. al. conjecture 

Conjecture 1.2. There are constants C > 0,a > 0 such that the following holds. Let A be an 
invertible n x n matrix and X a sparse random n x p matrix with 2jn < 0 < aj\/n. Then for 
p > Cnlogn, one can efficiently find a solution with probability 1 — o(l). 

As a matter of fact, they believe that ER-SpUD should perform well as long as p > Cnlogn, for 
some large constant C. They also proved that if one does not cared about the running time of the 
algorithm, then p > Cnlogn suffices. 

The analysis in m boils down to the concentration problem. For a vector v G M”, let := 
E||A'^u||i. Let c be a small positive constant (c = .1 suffices) and let Bad{v) be the event that 
~ Tv\\ > cfiy. We want to have 


(1.3) P(U^6RnRa(i(u)) = o(l). 

In other words, with high probability, ||A^u||i does not deviate significantly from its mean, simul¬ 
taneously for all v G M”. 

One needs to find the smallest value of p which guarantees (1.3). Notice that ||X^u|| is the sum 
of p iid random variables \Xiv\ where Xi are the rows of X. Thus, intuitively the larger p is, the 
more ||Ai^u|| concentrates. From below, we observe that (1.3) fails if p < n — 1, since in this case 
for any matrix X one can find a v such that X'^v = 0 and Hv > 1 (we can take v arbitrarily long). 
Spielman, Wang, and Wright [l5] showed that p > Cn^ log^ n suffices. We will prove 


Theorem 1.3. For any constant c > 0 there is a constant C > 0 such that (1.3) holds for any 
p > Cnlog^n. 


Beyond the current application, Theorem 1.3 may be of independent interest for several reasons. 
While concentration inequalities for random matrices are abundant, most of them concern the 
spectral or I 2 norm. We have not seen one which addresses the li norm as in this theorem. As 
sparsity plays crucial role in data analysis, techniques involving li norm (such as li optimization) 
become more and more important. Furthermore, in the proof we introduce two general ideas, which 
seem to be applicable in many settings. The first is an economical way to apply the union bound 
and the second is a rehned version of Bernstein’s concentration inequality for sums of independent 
variables. 


Using Theorem 1.3, we are able to give an improved analysis of ER-SpUD, which yields 


Theorem 1.4. There are constants C > 0,C' > 0 .such that the following holds. Let A be an 
invertible n x n matrix and X a sparse random n x p matrix with 2fn < 9 < Cj\/n. Then for 
p > Cnlog^n, one can efficiently find a solution with probability 1 — o(l). 

Our p is within a log^ n factor from the bound in Conjecture 


assumption that are symmetric from Theorem 0 


1.2 


Furthermore, we can drop the 
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Next, we will be able to refine Theorem 1.3 in two ways. First, combining the proof of Theorem 
1.4 with a result from random matrix theory, we obtain the following more general result, which 


handles the case when A is rectangular 


Theorem 1.5. There are constants C,a > 0 such that the following holds. Let n > m and A be 
an nxm matrix of rank m and and X a sparse random mxp matrix with ^jn <9< aj y/n. Then 
for p > Cnlog^n, one can efficiently find a solution with probability 1 — o(l) 

Second, in the sparest case 0 := 0(l/n), we develop a new algorithm that obtains the optimal 
bound p = Cnlogn, proving Conjecture |1.2| in this regime. 

Theorem 1.6. For any c > 0 there is a constant C > 0 such that the following holds. Let A be an 
invertible nxn matrix and X a sparse random nxp matrix with 9 = c/n. Then for p > Cnlogn, 
one can efficiently find a solution with probability 1 — o(l) 


Finally, let us mention the issue of theoretical recovery, regardless the running time. Without 
the complexity issue, Spielman et. al. showed that p > Cnlogn suffices, given that the random 
variable fij in the definition of X has a symmetric distribution. We could strengthen this theorem 
by removing this assumption. 


Theorem 1.7. There are constants C > 0,C' > 0 .such that the following holds. Let A be an 
invertible nxn matrix and X a sparse random nxp matrix with 2fn < 9 < Cj\/n. Then for 
p > Cnlogn, one can find a solution with probability 1 — o(l). 


The rest of the paper is organized as follows. In Sectionwe present the main ideas behind the 
proof of Theorem 1.3 The details follows next in Section Section contains the accompanying 
algorithms and an improved analysis of ER-SpUD, following m- Section addresses a general¬ 
ization to rectangular dictionaries. Section introduces a new algorithm that achieves the optimal 
bound in the sparse regime. In Section we prove Theorem 1.7 We conclude with Section in 
which we present some numerical experiments of the various algorithms. 


Acknowledgement. We would like to thank D. Spielman for bringing the problem to our attention. 


2. The main ideas and lemmas 

2.1. The standard e-net argument. Let us recall our task. For a vector v G let := 
E||X^u||i. Let c be a small positive constant (c = .1 suffices) and let Bad{v) be the event that 
|||X^u||i — /i^ll > cfi,]. We want to show that if p is sufficiently large, then 


(2.1) P(U^gRnRad(u)) = o(l). 

For the sake of presentation, let us assume that the random variables are Rademacher (taking 
values ±1 with probability 1/2); the entries Xij of X have the form Xij = Xij^ij-, where Xij iid 
indicator variables with mean 9. We start by a quick proof of the bound p > Cn^ log^ n obtained 
in [15]. Notice that the union in (1.3) contains infinitely many terms. The standard way to handle 
this is to use an e-net argument. 


Definition 2.1. A set N C M"' is an e-net of a set D C M” in Iq norm, for some 0 < g < oo, if for 
any x € D there is y G W so that \\x — y\\q < e. The unit sphere in Iq norm consists of vectors v 
where ||u||q = 1. B denotes the unit sphere in li norm. 

Considering the vectors in B is sufficient to prove the result. It is easy to show that for any 
V G B 


Train ■— P\/ 9/n ^ Tv p9 •— /^max; 
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where the lower bounds attend at n = (1 is the all one vector) and the upper bound at 

n = (1, 0,..., 0). Let A/q be the set of all vectors in B whose coordinates are integer multiples of 
n~^. Any vector in B would be of distance at most in li norm from some vector in A/q (thus 
A/q is an n“^-net of -B). A short consideration shows that iiu,v £ B are within of each other, 
then 




Thus, to prove (1.3), it suffices to show that 


(2.2) P{Uy(zj\fgBad{v)) = o(l). 

In order to bound P{[Jy^j\f^Bad{v)), let us first bound P{Bad{v)) for any B. Notice that 

p 

\\X^v\\i = Y,\Xiv\, 

i=l 

where Xi are the columns of X. The random variables \Xiv\ are iid, and one is poised to apply 
another standard tool, Bernstein’s inequality for the sum of independent random variables. 

Lemma 2.2. Let Zi,..., Zn be independent random variables such that \Zi\ < r with probability 
1. Let S := X]r=i Then for any T > 0 

max{P(5-EB < -r),P(B-ES| > T)} < exp(- ^ ) < exp(-min{^,^, £}). 

In our case Zi = \Xiv\ = ~ ^ 1 with probability 1 (we assume that 

f,ij are Rademacher) 


\Zi\ < ^ \vj\ = ||n||i = 1 
i=l 

with probability 1. This means we can set r = 1. Furthermore 


Var ^ Zi = pVsacZi < pE\Xiv\^ = p'^^Ovj < p9 ^ \vj\ = p9. 

i=i i=i 


2 = 1 


Finally, one can set T = cfimin = cp-s/WJn. Lemma 2.2 implies that 

^ / ,(P‘p^9ln cp\/9/n.. , c^p. 

P{Bad{v) < 2exp(-mm{ = 

since \f9jn > 1/n as 0 > 1/n. 

Using the union bound 


(2.3) 

we obtain 


P{Uy^f>j-^Bad{v) < ^ P{Bad{v)) 
vGJVo 


Pi'Jv&MoBadiv)) < |A/'o| X 2exp(-^). 

It is easy to check that A/q = exp(U(nlogn)). So, in order to make the RHS o(l), we need 
p > Cn^logn for a sufficiently large constant C. For the case when ^ij are not Bernoulli (but still 
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subgaussian) the calculation in [15] requires an extra logarithm term, which results in the bound 
p > Cn? log^ n. 

2.2. New ingredients. Our first idea is to find a more efficient variant of the union bound 

P(U^eAl)-Ba(i(u)) < E ^(Badiv)). 

Motivated by the inclusion-exclusion formula we try to capture some gain when Y*{Bad{u) H 
Bad{v)) is large for many pairs u, v. We observe that if we can group the elements v of the net into 
clusters so that within each cluster, the events Bad{v) (seen as subsets of the underlying probability 
space) are close to each other. Assume, for a moment, that one can split the net A/q into m disjoint 
clusters Ci, 1 < i < m, so that if u and v belong to the same cluster F{Bad{u)\Bad{v)) < pi, 
where pi is much smaller than pQ, then 

P{Uy^c^Bad{v)) < P(Bad(uW)) + \C^\pl, 
where uW is a representative point in Ci. Summing over i, one obtains 


(2.4) P{Uy(zj\fQBad{v)) < '^P{Uy(zCiBad{v)) < y^^P{Bad{v^'^)) + lA/'olpi < mpo + |A/'o|pi. 

i=l i=l 

We gain significantly if pi is much smaller than po and m is much smaller than |A/o|. Next, 
viewing the set of representatives as a new net A/i, we can iterate the argument, obtaining the 
following lemma. 

Lemma 2.3. Let V be a probability space. Let M = A/q be a finite set, where to each element 
V G A/q we associate a set Badfiv) C V. Assume that we can construct a sequence of sets 

A/l, A/l_i, . . . , A/q, 

and for each u G < I < L an event Badfiu) such that the following holds. For each v G A/j-i, 
there is u G Mi such that P{Badi_i{v)\Badi{u)) < pi and for each u G Ml, P{BadLiu)) < po. 
Then 


L 

(2.5) P(U^6Aro-Bado(i^)) < \Ml\po + E Wi-i\Pi- 

1=1 

The construction of Mi are of critical importance, and we are going to construct them using the 
loo distance, rather than the obvious choice of /i. This is the key point of our method. 

The next main technical ingredient is a more efficient way of using Bernstein’s inequality. Lemma 
12.21 Recall the bound 


(2.6) P(|S - ESI >T)< 2 exp(-^^^^-T_) < 2exp(- min{ L}). 

The first term on the right most formula is usually optimal. However, we need to improve 

the second term. The idea is to replace r with a smaller quantity t' such that the probability that 
\Zi\ < t' is close to 1. Let us illustrate this idea with the upper tail. Set p := E5, we consider 

PiS>p + T). 

Write 


Zi ZiJi + Zip 
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where Jj is the indicator of the event \Zi\ <t' and Ij = 1 — Ij. Thus 

5 := Zdi + J]; ZiJ, = g + 3(1). 

i i 


Let /Uj be the expectation of S{j). Then 


^{S>^l + T)< P(g > + r/2) + P(5(l) >^^2 + T/2). 


We can use Lemma 2.2 to bound P(g > /^i + T/2), which provides a bound better than (2.6) 
as now t' < r. On the other hand, if the probability that \Zi\ > t' is small, then we can bound 
P(S'(1) > //2 + T/2) in a different way, exploiting the fact that there will be very few non-zero 
summands in S'(l). 

We can (and have to) further refine this idea by considering a sequence of r', breaking S into the 
sum of g and 5'(fe), 1 < A: < M, for a properly chosen M. This will be our leading idea to bound 
the difference probability pi in the next section. 

On the abstract level, our method bears a similarity to the chaining argument from the theory 


of Banach spaces. We are going to discuss this point in Section 3.7 


3. Proof of Theorem O 

For the sake of presentation, we assume that Xij = Xijiij where Xij iid Bernoulli random 
variables with mean 6 and ^ij are iid Rademachers random variables. In fact, p > Cnlog^n is 
sufficient for the Rademacher case. The proof can be easily modified for ^ij being general sub- 
gaussian at the cost of a \/log n factor in the bound for p (See Section 3.6). We recall the notation 
IJ-min = p-y/ 0/n, Pmax = pG', Pv '■= E||X'^u||i. B is the set of all vectors of unit h norm. 

We set p = Cnlog^n, for a sufficiently large constant C. Let T := for a small constant 

Co > 0 and K := 


3.1. a-nets in loo norm. 

Lemma 3.1. For any 1 > a >2/n, B admits an a-net in loo norm of size at most exp(2Q;“^ logn). 

Proof. Let M be the collection of all vectors v € B, whose coordinates are integer multiples of a. 
Obviously, M is an a-net of B in loo norm. Furthermore, any v € M satisfies ||u||i < 1, so it has at 
most k := a~^ non-zero coordinates. If a coordinate is non-zero, it can take at most 2a“^ -|- 1 < 3A: 
values. Therefore, 


l■V|sE(")(3T. 

i=0 ^ ^ 

As a > 2/n, the RHS is at most 

( Hk )^ < n(^ X 3fe)^ = n(2en)^ < exp(2a“^ logn). 
k 

□ 

The key here is that we consider an a-net in loo norm, rather than in li norm, which appears to 
be a natural choice. 
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3.2. Building a nested sequence. Recall that A/q is the set of vectors v in B whose coordinates 
are integer mnltiples of n~^. We have 

(3.1) |A/o| < (2n^ + !)"■ < exp(4nlogn). 

Consider the seqnence oq = 2/n; ai = 2a/_i for Z = 1,..., L, where L < log 2 n is the first index 
snch that ai > 1/2. Let A/"/ be an a;-net of B in the l^o norm. By Lemma 
snch that 

(3.2) lA/"/! < exp(2a)“^ logn). 

We now build a nested sequence A/l C Ml-i C • • • C A/i C A/q as follows. Assume that A/z-i 
has been built. Use the points in A/"/ as centers to construct a Voronoi partition of the points of 
A/)_i with respect to the Zoo norm (ties are broken arbitrarily). For each point u G let Cu be 
the subset of A/)_i corresponds to u. By definition, ||tt — n||oo < ai for any v G Cu, 

Partition the interval [Umin, fJ-max] = \p-\/G/n,p9] into K intervals of equal lengths. 

We partition Cu further into K subsets Cuj, 1 < J < AT, where v G Cuj if E||At)||i G Ij . By this 
construction, if v, w belong to the same Cu , j , then by the definition of K, we have the key relations 

(3.3) \\v — w\\oo ^ and \E,\\Xv\\i— 'Ei\\Xw\\i\ < p9/K <T/Q. 

From each set Cuj choose an arbitrary element v. Thus, each u G A/"/ gives rise to a set Ru of K 
elements {R stands for representative). Define 


3.1 


we can choose A/"/ 


Ml := ^u&N!^u- 


It is clear that A/) C Mi-i and 


(3.4) 


lA/l < K\Ml\ < ATexp(2Q:; ^logn). 


3.3. Bounding the differences. Consider the construction of Mi, 1 < Z < L, from Section 3.1 
Let V G Ml- Thus, v G Cu,j for some u G A/"/ and 1 < j < AT. Consider another point w G Muj 
Our main task is to show 


Lemma 3.2. For all pairs v,w as above 


(3.5) 


p{v, w) := P(|||A"' n||i — || u;||i| > T) < exp(—5a, ^ logn). 


The rest of this section is devoted to the proof of this lemma. By (3.3), we have 


(3.6) \\v — w\\oo < 2ai and |E||A^ti||i — E||A'^ri;||i| < < T/6. 

Define Zj = \Xiv\ — |Ajr(;|, where Xi is the Zth row of X"^] we have 


|A^r;||i - llA^rclli = ^(|Wi;| - \Xiw\) = ^ A*, 


i=l 


2 = 1 


Set S := Yji=i by symmetry, it suffices to bound 

R[Zi + --- + Zp>T) :=P(A>T). 
Notice that by the triangle inequality 


1^*1 = 


iXjvl — 


< \Xi{v - t(;)|. 


Therefore 
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VarZi < BZf < B\Xi{v - w)\‘^ = 9 ^{vj - 

i=i 

Recall that ||u||, ||t(;|| < 1 and ||i; — tc||oo < o.i. Therefore 


This implies 


(3.7) 


'^{vj - Wjf < az ^ \vj\ + \wj\ = 2ai. 
i=i i=i 


VarZj < EZf < 2ai9. 


We denote by li^k the event that Tk < Zi < Tk-i for /c = 1,..., M and Ji the event that \Zi\ < tm, 
for a sequence k = 0,, M, where tq = 2; r* = 2“Vo and M is the first index so that 


(3.8) 


mini —^I > 8 log n. 
^8az^ 4az ^ 


35 log ' 

assumption and cover the remaining cases at the end of the proof. Apparently. 


Note that if ai < -^ log ^ n then such an index M > 1 exists. We will proceed with this 


Zi ^ ^ ^ ZiJ-i^f^ “h ZiJi. 


2=1 


Set S{k) = '^i^i Zili^k for A: = 1,..., M and Q = Yli=i We have 

M ^ 


PiS >T)< P(Q > T/2) + Y, P{S{k) > —). 


k=l 


To bound P((5 > J"/2), we notice that (see ( 3.11[ )) the choice of tm guarantees that P(Ji) > 
1 — 2re“® for alH = 1,... ,p. As \Zi\ <2 with probability 1, it follows that 


lEZj'J,' — EZ,'| < 4n 


and so 


\EQ - E5| < 4pn-® = 0 ( 71 “®), 

as p = o(n^). On the other hand, by (3.6), T > 5(E5 + n“®). Thus 

P(Q > T/2) < P(g > EQ + r/4). 

By dehnition, Q is sum of p iid random variables, each is bounded by tm in absolute value with 
probability 1. Furthermore, by (3.7) 


VarQ = pVarZiJi < pEZf < 2ai6p. 


By Lemma 2.2, we have 


(r/4)2 T/4 T T 

(3.9) P(Q > E<3 + r/4) < 2(exp(-mm{ —.—}) = 2exp(-mm{ —}). 

Now we bound P{S{k) > ^), for A: = 1,..., M. Recall that S{k) := Xf=i is a sum of iid 
non-negative random variables, each is either 0 or in (r^ and Tk-i]. Thus, if S{k) > T/2M there 
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T/2M 

must be at least pk '■= -r— indices i such that Zi > Tk- Let pk be the probability that Z\ > Tk- 

‘k — 1 

Then by the union bound and the fact that p = o{n'^), 


Pk 


(3.10) P(S(t) i-Ptf < (tW) 

2M \pkj Pk 2 

To complete the analysis, we need to estimate pk- By definition 

Pk := P(|-^iv| - > Tk) < P{\Xi{v - t(;)| > Tk). 

The random variable Zi := Xi{v — w) = “ '^j) mean 0. Furthermore, by (3.7), 

VarZi < ZI < 2ai9. Finally, each term Cj{vj — Wj) is at most ai in absolute value. Thus Lemma 


2.2 implies 


(3.11) 


Pk < P(|^i| > Tk) < 2(exp(- min{^^, ^}). 


This and (3.10) yield 


(3.12) 

By 


P{S{k) > ^) <2exp(-(min{-^,-^} + 21ogn 


2M 


8ai9 4ai 




so 




^ 8ai9 Aai 

By definition pk = as Tk-i = 2Tk. Therefore, 


1 T, 


k ^kT 


Pk = 


and 


2 8ai9^ UMai9 

1 Tk ^ T 
2Aai^’" ~ 32Mai' 


By (3.9) and (3.12), we conclude that 


(3.13) P(S'> T) < 2exp(—min{ 


T 


M 


I28ai9p 16 tm 


}) + E 2 exp(— min{ 


TkT 


T 


k=l 


QAMai9 ’ 32Mai 


})• 


A routine verification (see Section 3.5) shows that once p > Cnlog'^ n for a sufficient large 


constant C, then the RHS in (3.13) is at most exp(—5 q:; ^ logn), completing the proof for the case 

az < 4 log 


-1 


n. 


To complete the proof, we now treat the remaining case when a; > log~^ n.. In this case, we 
do not need to split Zi. Recall S = Zi + ■ ■ ■ + Zp whre \Zi\ <2 with probability 1, E5 < T/b and 
VarS < 2p9ai. By Lemma 2.2, we have 


P(5 > T) < P(5 > E5 + T/2) < exp(- min{^, |}). 
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. . '~p2i I 

By the analysis of (3.13), we already know that logn. On the other hand, as 


-1 


az > 4 log ^ n 


32 


T coPy^ 0 /n cqC . 2 \ c -i 1 

- = —-= ^Vt'nlog n > 5a; logn, 

8 8 log n 8 

given that cqC is sufficiently large. This completes the proof. 


3.4. Proof of the Concentration lemma. For v G A/i,0 < I < L, let Badi{v) be the event that 
\\\Xv\\i - >2{L + l-l)T. For/ = 0, 2(L + 1 - = 2(L + l)r < ^coliog^n+i)^^,^ < 4cop™„. 

Thus, 


P(U|;^_y\/'Q I ll-X" v\\l ^ 4co^mm) ^ ^ . 

Assume that there is a number po such that P{BadQ{v)) < po for all v G A/q. Assume furthermore 
that for any 1 < / < T, there is a number pi such that for v € Mi and w G Mi-i where v is the 


representative of the set that contains w (see the construction in Section 3.2) 


P{Badi{w)\Badi-i{v)) <pi. 


Then by Lemma 2.3 


P(U^eAl)) < \Ml\po + ^ \Mi-i\pi. 


1=1 


To find pi, notice that if Badi-i{w) holds and Badi{v) does not, then |||A^i(;||i — pw\ > 2(L + 
2 — l)T and |||X^n||i — pt,| < 2(L + 1 — l)T. By (|3.3|), — IJ-wl < T- It thus follows that 


lA^uzIli - ||X^n||i| > T. 


By the main lemma of Section 3.3 we know tha t the probability of this event is at most pi := 
exp(—5a;“^ logn), for all 1. Recall from Section 


3.2 


that 


I A/; I < iLexp(2a, ^logn) = iFexp(4a, ^logn), 


we have 


L L 

\Mi-i\pi < yy exp(—4a;~^ logn) x K exp(4a;“^ logn). 
z=i z=i 

Since K = 0(n^/^) and logn > logn, the RHS is at most 


L' 


yy exp(—.5a; ^logn) = o(l). 
z=i 


2.2 


To conclude, notice that by Lemma 
exp(—2a^^ logn) < exp(41ogn) since ul > 1/2, we have 


we can set po := 2exp(—min{^, |^}). 


As \Ml\ < 


Po|A/'o| = 0 ( 1 ), 

as long as min{^,|^} > 5logn. This condition holds if p > Cnlog^n for a sufficiently large 
constant C. This implies that 

P(Uu£A/o{IIIZ Pv\\ ^ ^CQPmin) — ^(1)) 


and we are done by (2.2). 
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3.5. The magnitude of p. We present the routine verification concerning the exponents in (3.13). 


This is the only place where the magnitude of p matters. Recall that T = 


and 


p = Cnlog^ n (since for the sake of exposition we are only considering the Rademacher case). We 
have 


T2 


128ai9p 1280plog^n 


cye/n ^ ^ ^2(jar^^ogn > A.la^^logn, 


n 


provided that CgC > 4.1. 


By the definition of M in (3.8), we have 


This implies that 


It follows that 


32 log n > mini > 81ogn. 

ootiu Aai 


tm < max{16Y^ ai9 log n, 1280 ; logn}. 


T , T 

> minj- 


16rM ' 256^061 log n ’ 2048\/2a; log n 

By the definition of p and T 


}• 


since 


256\/a0Tog7i 256\/ o^nlog^ n 

cqC > 4.1 and no; > noo ^ > 1- Furthermore, 


T 


= a 


coCn log^ n-s/Ojn 

2048\/2a; log n ^ 2048-v/21ogn 

Next, we bound the exponent —. As M < logn, we have 


= u}{a, ^ logn). 


T coCnlog^ n^/eJn _iCoC ^ _i 

^ -T,— 1 —^-= “/ —v6»nlogn > 4.1a, logn, 

32Ma; 32a; logn 32 


provided that CQCf32 > 4.1, since On > 1. 

rp ^-“2 

Finally, we bound the exponent By definition > 8 logn and M < logn thus 

TkT Sy/aiO log nT _^cqC - 3/2 , _i . 

64Ma;6» “ 641ogna;(9 ^ g v < b v ; bo 

concluding the proof. 


3.6. Extension from Rademacher to general sub-gaussian variables. We introduce the 
truncation operator Tr : —>■ as 


iTr[M]),, 


Mij \Mij\<T 
0 else 


Let T = y/C log n and let 

X' = Tr[X]. 

For C sufficiently large, the probability that X' = X is 1 — o(l). This allows us to work with 
random matrix whose entries are bounded by r (instead of 1 as in the Rademacher case). The 
same proof will go through if we increase p by Cir, for a sufficiently large constant Ci. This means 
p = 0(nlog^'^ n) suffices. We round 3.5 up to 4 for cosmetic reasons. 
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3.7. Concluding remarks. There is a connection between the method of our proof and Fernique’s 
chaining argument [5] (see [16] for a survey). The goal of the chaining method is to bound the 
supermum sup^g^ Xt where B is a domain in a metrics space and Xt is a Gaussian process. In this 
case, the bad event Bad{v) can roughly be defined as X^ > M^, for some candidate value M^. One 
then considers a chain of sets in order to bound Y*[Uy^BBad[v)). This, in spirit, is similar to the 


purpose of Lemma 2.3 


After this, the arguments become different in all aspects. First, in our setting, the bad event 
Bad{v) can have any nature. Next, in the chaining argument, the sets Mj are defined using the 
metrics of B, while in our case, it is crucial to use a different metrics. We construct A/)- using the loo 
norm, rather than the natural li norm used to define the domain B. Finally, in the chaining case 
it is easy to bound P{Bad{u)\Bad{v)), using the fact that P(|Xn — Xy\ > t) < 2exp(— 
which is the basic property of a Gaussian process. In our case, bounding P{Bad{u)\Bad{v)) is an 


essential step (Lemma 3.2), which requires the development of the refined Bernstein’s inequality. 


4. The algorithm and concentration of random matrices 


As the algorithm and analysis are discussed extensively in US], we will be brief and the readers 
can consult m for more details. (TS) introduces the dictionary learning algorithm ER-SpUD. The 
key insight in the design of ER-SpUD is that the rows of X are likely to be the sparsest vectors 
in the row space of Y. (This observation also appeared |20| and ffB ) m proposed to find these 
vectors by considering the following optimization problems. 


minimize ||t(;^Y||i subject to r'^w = 1 
where r is a row of two columns of Y. 

Using li optimization for finding sparse vectors is a natural idea, and the authors of m pointed 
out that such an approach was already proposed in m and |8| . The difference is the new constraint 
r'^w = 1. (Earlier works used different constraints.) 

By a change of variables 2 : = A^w, b = A~^r, we can consider the equivalent problem 

(4.1) minimize || 2 ;’^X||i subject to z = 1. 

The algorithm presented in [T5| is outlined below (for those familiar with [T5|, note that we are 
presenting the two-column version of ER-SpUD): 

Algorithm 1 ER-SpUD 

1: Randomly pair the columns of Y into p/2 groups gj = {Ye^^, YejA 
2: For j = 1,... ,p/2 

Let rj = Yej-^ + where gj = {YYejA 

Solve miniu\\w'^Y\\i subject to iXrjY'w = 1, and set Sj = wAY. 

3: Use Greedy algorithm to reconstruct X and A. 


Algorithm 2 Greedy 

1: Require: S = {si,..., st} C MP 
2: For i = 1.. .n 
REPEAT 

I arg mins;g 5 ||s;||o, breaking ties arbitrarily 

Xi = Sj 

S = S\{si} 

UNTIL rank{[xi,... ,Xi\) = i 
3: Set X = [xi,.. .,XiY, and A = YY^(XY^)-i 
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A key technical step in analyzing ER-SpUD is the following lemma, which asserts that if p is 
sufficiently large, then with high probability ||X^?;||i is close to its mean, simultaneously for all 
unit vectors v G M”. 

Lemma 4.1. For every constant 1 > J > 0 there is a constant Cq > 0 such that the following 
holds. If 8 > ^ and p > Cqu^ log^ n, then with probability 1 — o(l), for all v G M” 


\X^v\\i - EllX^ullil < 6E\\X^v\ 


(4.2) 

This lemma appears implicitly in |15] . Dan Spielman pointed out to us that this would imply 
the critical m Lemm a 17]. The bound p > Cn^ log^ n is of importance in the proof of th is lemma. 
Our Theorem 


1.3 


which pushes p to C'nlog'^n, is an improved version of Lemma 4.1 


With Theorem 1.3 in hand, let us now sketch the proof of Theorem |1.4t following the analysis 
in [15] . 

Notice that if the solution of the li optimization problem, z*, is 1-sparse, then the algorithm 
will recover a row of X. The proof of the theorem relies on showing that is supported on the 
non-zero indices of b and that with high-probability, z* is in fact 1-sparse. The first goal allows 
us to focus our attention on a submatrix of X which will be convenient for technical reasons. To 
address this first issue, we prove the following. 

Lemma 4.2. Suppose that X satisfies the Bernoulli-Subgaussian model. There exists a numerical 
constant C > 0 such that if On >2 and 

p > Cn log^ n 

then the random matrix X has the following property with probability at least 1 — o(l). 

(PI) For every b satisfying ||6||o < 1/80, any solution to the optimization problem f.l has 
supp{z^) C supp{b). 


Sketch of the Proof of Lemma f.2. We let J be the indices of the s non-zero entries of b. Let S 
be the indices of the nonzero columns in Xj, and let zq = Pjz^ (the restriction to those coordinates 
indexed by J). Define = z* — zq. We demonstrate that zo has at least as low an objective as z* 
so zi must be zero. One can show using the triangle inequality that 


Xlh > 


|i ^ ||Z5 Xjji - 2||2;( X*^!!! -I- \\zi zv||i. 

Thus, if llzfXjli -2||zfX^||i > 0, then zq has a lower objective value. We need this inequality to 
hold for all z with high probability. Notice that 

E[||z^X||i - 2||z^X^||i] = {p- 2|5|)E|z^Xi| 

It is easy to show that |5| < p/4 with high probability so (p — 2|5|) > 0 with high probability. 
Therefore, if we can show that Ijz^Xjli — 2||z^X'^||i is concentrated near its positive expectation 
we are done. 

We see that it suffices to show the result for the worst case \S\ = p/4:. Now we make critical use 
of Theorem 1.3, which asserts that with high probability, 

.T 


.rx5 


Xll 


5p 


z"X 1 > -E R'X 1 = —ER"X 


8 


■-1 • 


and 


so 




Ii < 


-EIR^X^ 


|i = ^Ejz^Xij. 


Iz^Xjji - 2| 


zV 


|i > |e|z^Xi| >0. 


Having proved Lemma 4.2 the rest of the proof is relatively simple and follows m exactly. The 
success of the algorithm now depends on the existence of a sufficient gap between the largest and 
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second largest entry in b. The intuition is that if X preserved the li norm exactly, i.e. ||z^X||i = 
c|| 2 ;||i, then the minimization procedure will output the vector z of smallest li norm such that 
= 1, which is just ej^jbj^, where j* is the index of the element of b with the largest magnitude. 
However, X only preserves the h norm in an approximate sense. Yet, the algorithm will still extract 
a column of X if there is a signihcant gap between the largest element of b and the second largest. 


5. Rectangular dictionaries and Theorem 11.51 

We now present a generalization of ER-SpUD, which enables us to deal with rectangular dictio¬ 
nary. Consider a full rank matrix A of size n > m, such that n > m, and the equation AX = Y. 
To deal with this setting, we first augment H to be a square, n x n, invertible matrix. Of course, 
the issue is that one does not know A, and also need to figure out how the augmentation changes 
the product Y. 

We can solve this issue using a random augmentation. For instance, we can use n x (n — m) 
gaussian matrix B to augment H to a square matrix A' (the entries in B are iid standard gaussian). 
It is trivial that the augmented matrix has full rank with probability 1, since the probability that 
a gaussian vector belongs to any fixed hyperplane is zero. We can also augment X from an. m x p 
matrix to a n x p matrix, X' by an (n — m) x p random matrix Z with entries iid to those of X. 
This augmentation process yields a matrix equation 


Y' = A'X' 


where Y' = Y + E where E = BZ (Figure [^. In practice, we can first generate B,Z, then compute 
E := BZ and construct Y' := Y -|- E'. Next then apply the ER-SpUD algorithm to the equation 
Y' = A'X' to recover A' and X' with high probability. From these two matrices, we can then 
deduce A and X. 

Using a gaussian (or any continuous) augmentation is convenient, as the resulting matrix is 
obviously full rank. However, it is, in some way, a cheat. Apparently, a gaussian number does not 
have any finite representation, thus it takes forever to read the input, let alone process it. A common 
practice is to truncate (as a matter of fact, the computer only generates a finite approximation of 
the gaussian numbers anyway), and hope that the truncation is fine for our purpose. But then we 
face a non-trivial theoretical question to analyze this approximation. How many decimal places are 
enough ? Even if we can prove a guarantee here, using it in practice would require computing with 
a matrix with many long entries, which significantly increases the running time. 

We can avoid this problem by using random matrices with discrete distributions, such as ±1. 
The technical issue now is to prove the full rank property. This is a highly non-trivial problem,but 
luckily was taken care of in the following result of Bourgain, Vu, and Wood [3]. 

Theorem 5.1. For every e > 0 there exists d > 0 such that the following holds. Let Nbe an n 
by n complex matrix in which f rows contain fixed, non-random entries and where the other rows 
contain entries that are independent discrete random variables. If the fixed rows have co-rank k 
and if for every random entry a, we have maXxP{a = x) < 1 — e, then for all sufficiently large n 

P{Nf^n has co-rank > k) < {1 — . 

Letting, k = 0 and f = m, the result shows that if we augment A by n x {m—n) random Bernoulli 
matrix, this new matrix. A', will be nonsingular with high probability, given that n — m = uj{\). 

We summarize our reasoning in the following algorithm. 
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Figure 1. Rectangular A with n > m 


Algorithm 3 Rectangular Algorithm 

1: Generate a (n — m) x p matrix Z with iid random variables that agree with the model for X. 
2: Generate a n x {n — m) matrix B with iid entries (either Gaussian or Rademacher). 

3: Run ER-SpUD ouY' = Y + BZ 

4: Remove the rows of A' and the columns of X' from the output of ER-SpUD. 


6. Optimal bound for very sparse random matrices 


In this section, we discuss Theorem 1.6 We present a simple algorithm (see below) and use this 
algorithm to prove Theorem 1.6 obtaining the optimal bound p = Cnlogn. 


Algorithm 4 Very-sparse Algorithm 

1: Partition the columns of Y into a minimum number of groups Gi whose members are multiples 
of each other. 

2: Ghoose representatives of those Gi with more than two members to be the columns of A up to 
scaling. 


Proof of Theorem 1.6. Since A is nonsingular, any two columns of Y that are multiples of each 
other must be linear combinations of the same columns of A. Eor a group Gi to have more than 
two members would require that there be more than two columns in X with their non-zero entries 
in the same rows. 


Definition 6.1. We say that a set of columns are aligned if they each have more than one nonzero 
entry and their non-zero entries occur in the same positions. 

Lemma 6.2. The probability that X has more than two aligned columns is o(l). 

Thus, the algorithm is likely to yield only columns of A. We now need to show that all the 
columns of A will be outputted with high probability. 


Definition 6.3. We say the column a of A is k-represented if some group Gi consists of multiples 
of a and |Cri| = k. In particular, if no multiple of the jth column, aj, shows up in the columns of 
Y then aj is 0-represented. A column is well represented if it is /c-represented for k > 2. 


Notice that the algorithm will output a multiple of every column that is well represented. 


The following lemma finishes the proof of Theorem 1.6 


Lemma 6.4. The probability that every column aj is well represented is 1 — o(l). 
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6.1. Proofs of Sparse Algorithm. Proof of Lemma 6.2. Given the choice of 9, we know that 


the number of nonzero entries in any column of X will converge to the Poisson distribution. We 
ignore the o(l/n) error terms from this approximation in later calculations to alleviate clutter. To 
calculate the probability, we condition on the number of nonzero entries, and then we bound the 
probability that three specihc columns have the required property, and finally we use the union 
bound. This yields an upper bound of 

n\ 1 


E 

k>2 


[kir (lY 


= 0 ( 1 ) 


□ 


Proof of Lemma \6.4[ By the union bound, 

P(3i such that ai is not well represented) < nP(ai is not well represented) 
Partitioning into disjoint events yields 

2 

P(ai is not well represented) = ^^P(ai is j-represented) 

j=0 

Notice that a multiple of ai, say o * ai, appears as a column of Y if and only if a * ei = 
(a, 0,0,..., 0)^, with a / 0, is , the jth column of X, for some j. Now, using the Poisson 
approximation we can bound each term in the summand. For example, for the probability of being 
0-represented, we can divide into the case that A* does not have exactly one non-zero element and 
the case that A* has exactly one non-zero term but not in the hrst row. We use C to indicate an 
absolute constant which may change with each appearance. 


P(ai is 0-represented) < ( (1 — ce ^) + e 


71 — 1 

' < Cexp(—Cp/n) 


n 


Similarly, 


P(ai is 1-represented) < n 


ce 


n 


(1 — ce ^) + e 


,n — 1 


n 


p-i 


< C exp(—Cp/n) 


and 


P(ai is 2-represented) < 


ce 


r. \ 2 


n 


{I — ce + e 


,n — 1 


n 


p-2 


< C exp{—Cp/n) 


Thus, 

P(ai is not well represented) < C exp(logn — Cpjn) = o(l) 
for p = C'n log n for a large enough C. 


□ 


7. Proof of Theorem 11.71 

7.1. Lemmas Independent of Symmetry. We first state the necessary lemmas from m whose 
proofs do not use the symmetry of the random variables. 

Lemma 7.1. If rank{X) = n, A is nonsingular, and Y can be decomposed into Y = A'X', then 
the row spaces of X', X, and Y are the same. 

The general idea is to show that the sparsest vectors in the row-span of Y are the rows of A. 
Since all of the rows of X' lie in the row-span of Y, intuitively, they can be sparse only when they 
are multiples of the rows of A. Naively, this is because rows of A are likely to have nearly disjoint 
supports. Thus, any linear combination of them will probably increase the number of nonzero 
entries. 
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Lemma 7.2. Let O be an n x p Bernoulli(9) matrix with 1/n < 6 < 1/4. For each set S C [n], 
let Ts C [p] he the indices of the columns of H that have at least one non-zero entry in some row 
indexed by S. 

(a) For every set S of size 2, 

PdT^I < (4/3)0p) < exp 

(b) For every set S of size a with 3 < a < 1/6, 

Pd?/?! < {3a/8)9p) < exp 

(c) For every set S of size a with 1/9 < a, 

P(|rs| < (1 - l/e)p/2) < exp 
7.2. Generalized Lemmas. We will use a result of |14j . 


Lemma 7.3. Let be independent centered random variables with variances at least 1 and 

fourth moments bounded by B. Then there exists v G (0,1) depending only on B, such that for 
every coefficient vector a = (ai,..., an) G 5’"’“^ the random sum S = Jf//=i satisfies 

P{\S\<l)<u 

Definition 7.4. We call a vector a G M”' fully dense if for all i G [n], Oj 7 ^ 0. 

Lemma 7.5. For b > s, let F[ £ be a matrix with one nonzero in each column. Let R he 

a s-by-b matrix with independent centered random variables with variances at least 1 and bounded 
fourth moments. Define U = H Q R Then the probability that the left nullspace of U contains a 
fully dense vector is at most 


Proof of Lemma 1.5. Let D = [ui| ... |ttfe] denote the columns of U and for each j G [ 6 ], let 


Nj be the left nullspace of [tti| ... \uj\. We show that with high probability N^, cannot contain a 
fully dense vector. This can be done by showing that if iVj_i contains a fully dense vector then 
with probability 1/2 the dimension of Nj is less than the dimension of Nj-i. Formally, consider a 
fully dense vector a G Ad-i. If Uj contains only one nonzero entry, then a^Uj 7 ^ 0 reducing the 


dimension of Nj. If Uj contains more than one non-zero entry, then Lemma 7.3 implies that the 


probability, over the choice of entries of Rj, that ot^Uj = 0 is less than 1/2. 

Note that the dimension cannot decrease more than s times. For W to contain a fully dense 
vector, there must be at least b — s columns for which the dimension of the nullspace does nto 
decrease. Let F C [b] have size b — s. The probability that for every j G F, contains a fully 


dense vector and that the dimension of Nj equals the dimension of is at most 2 
the union bound, the probability that Nf, contains a fully dense vector is at most 


-b-fis—l 


By 


2-b+s < 2 


—fe+s log(e^6/s) 


□ 

The proofs of the following lemmas are identical to those in m except that they now use our 
more general Lemma |7.5| along with the lemmas in the previous section. 


Lemma 7.6. For t > 200s, let 11 G {0, 1 }^^* be any binary matrix with at least one nonzero 
in each column. Let R G he a random matrix whose entries are iid random variables, with 

P{Rij = 0) = 0, and let U = Ll Q R. Then, the probability that there exists a fully-dense vector a 
for which ||a'^f7||o < t/5 is at most 
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Lemma 7.7. If X = Q Q R follows the Bernoulli-Subgaussian model with P{Rij = 0) = 0, 
1/n < 9 < 1/C and p > Cnlogn, then the probability that there is a vector a with support of size 
larger than 1 for which 

||a^X||o < (ll/9)0p 

is at most ex.p{—c6p), and C,c are numerical constants. 


7.3. Proof of Theorem |1.7[ Say Y can be decomposed as A'X'. From Lemma |7.7[ we know 


that with probability at most exp(—c0p), any linear combination of two or more rows of X has at 
least (ll/9)0p nonzeros. By a simple Chernoff bound, the probability that any row of X has more 
than (10/9)9p nonzero entries is bounded by nexp(—0p/243). Thus, the rows of X are likely the 
sparsest in row{X). 

On the previous event of probability at least l — exp{—c9p), X does not have any left null vectors 
with more than one nonzero entry. Therefore, if the rows of X are nonzero, X will have no nonzero 
vectors in its left nullspace. The probability that all of the rows of X are nonzero is at least 
1 — n(l — 9)P > 1 — nexp(—cp). From this, by Lemma 7.1, we get row{X) = row{Y) = row{X'). 
Hence, we can conclude that every row in X' is a scalar multiple of a row ol X. □ 


8. Numerical Simulations 

We demonstrate that the efficiency of the ER-SpUD algorithm is not improved with larger p 
values beyond the threshold conjectured. In Figure we have chosen H to be an n x n matrix of 
independent N{0, 1) random variables. The nxp matrix X has k randomly chosen non-zero entries 
which are Rademacher. The graph on the left of Figure is generated with p = 5nlogn and the 
one on the right with p = 5re^ log^ n. For both graphs, n varies from 10 to 60 and k from 1 to 10. 
Accuracy is measured in terms of relative error: 

re{A',A) = minn,A\\A'AU - A||F/||A||i7’ 

The average relative error over ten trials is reported. 


ER-SpUD Small Sample 



dictionary size: n 


ER-SpUD Large Sample Rel. Error 



dictionary size: n 


Figure 2. Mean relative errors of ER-SpUD with p = Srelogn versus p = 5n^ log^ n 

We then ran our Algorithm in a sparse regime to compare its performance with that of ER- 
SpUD (see Figure]^ A was as before, but since our algorithm relies on the appearance of 1-sparse 
columns in A, we cannot fix sparsity as in our first experiments. Rather, we vary the Bernoulli 
parameter 9 from 0.02 to 0.18, and the Xij are Rademacher. One can see the expected phase 
transition at which point the matrix X is no longer sparse enough for our algorithm. In the regime 
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for which the algorithm was designed, the relative error of our output is on the same order as that 
of ER-SpUD. Furthermore, our algorithm runs much quicker and has no trouble with inputs of size 
up to re = 500. (The numerical experiments were completed on a Macbook Pro.) 

Finally, we compare the outcome of our optimal p value with that of a much larger sample size 
{p = 0{'n? log^ re)). We let re range from 10 to 200 and 6 from 0.01 to 0.08. Figure^shows that the 
efficacy of the algorithm is not much improved despite the dramatic increase in p. The threshold 
for failure is identical. 


Sparse Algorithm 
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10 30 50 

dictionary size: n 
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ER-SpUD Rel. Error 



10 30 50 

dictionary size: n 


Figure 3. Mean relative errors with varying sparsity 6. Here, p = 5relogre. 


Sparse Alg Small Sample Sparse Alg Large Sample Rel. Error 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

50 100 150 200 50 100 150 200 

dictionary size: n dictionary size: n 



Figure 4. Mean relative errors of Algorithm with p = 5relogre versus p = 5re^ log^ re 
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